In this document we estimate the talent pool factoring in the changing levels of interest in the game. This document will depart from the calculations made in and . We will take a more exacting look at the talent pool and historical baseball interest than those prior references.
Baseball started as a game played by people largely from the Northeast (NE) and Midwest (MW) regions of the United States of America. Our construction of the talent pool will be pegged to the population of age 20-29 males from the NE and MW region of the United States of America. Simply stated, the talent pool will be \[ \frac{\text{NE and MW region populations of age 20-29 males}}{\text{proportion of MLB players from NE and MW regions}} \times \text{adjustment}. \] Adjustments include the effects of changing interest in baseball, wars, segregation, and gradual integration of the MLB. In our analysis, the states that comprise the NE and MW regions are: Connecticut, Delaware, Illinois, Indiana, Iowa, Kansas, Maine, Massachusetts, Michigan, Minnesota, Nebraska, New Hampshire, New Jersey, New York, North Dakota, Ohio, Pennsylvania, Rhode Island, South Dakota, Vermont, and Wisconsin.
We load in the necessary software packages used in this analysis.
library(Lahman)
library(tidyverse)
library(ggplot2)
library(splines)
We load in the Batting and Pitching data sets in the Lahman R package and isolate the birth state and birth country for every MLB player in history (as recorded by the Lahman Database and implemented in the Lahman R package). Missing entries are discarded.
foo_bat = Batting %>% dplyr::select(playerID, yearID) %>%
left_join(People %>% dplyr::select(playerID, birthCountry, birthState))
foo_pit = Pitching %>% dplyr::select(playerID, yearID) %>%
left_join(People %>% dplyr::select(playerID, birthCountry, birthState))
## adjust for missingness
foo_tot = rbind(foo_bat, foo_pit) %>% distinct()
foo_tot[is.na(foo_tot$birthCountry), ]$birthCountry = "missing"
foo_tot[!is.na(foo_tot$birthCountry) &
is.na(foo_tot$birthState), ]$birthState = "missing"
We add a regions variable which will allow us to plot the proportion of MLB players from each region over time (0 will indicate an international player).
NMLB = foo_tot %>% group_by(yearID) %>%
summarise(N = n())
state_perc = foo_tot %>%
group_by(yearID, birthState, birthCountry) %>%
summarise(yearID, birthState, birthCountry, Nstate = n()) %>%
distinct() %>%
left_join(NMLB) %>%
mutate(state_perc = Nstate/N) %>%
mutate(region = ifelse(birthState %in% c("CT", "MA", "DE", "ME", "NH",
"NJ", "NY", "PA", "VT", "RI"), "NE", "0"),
region = ifelse(birthState %in% c("IL", "IN", "MI", "OH", "WI",
"IA", "KS", "MN", "MO", "NE", "ND", "SD"),
"MW", region),
region = ifelse(birthState %in% c("WA", "MT", "ID", "WY", "OR",
"CA", "NV", "UT", "CO", "AZ",
"NM"), "W", region),
region = ifelse(birthCountry == "USA" & region == "0", "S", region),
region = ifelse(region == "0", "INTL", region))
We now plot the proportion of MLB players from each region. From this we see that baseball was primarily a Northeastern and Midwestern game in the early years. We can also see that the game became more and more international beginning around integration in 1947. This is consistent with Armour and Levitt (2016).
region_perc = state_perc %>% group_by(yearID, region) %>%
summarise(region_perc = sum(state_perc))
## By region
ggplot(region_perc, aes(x = yearID, y = region_perc)) +
geom_line() +
facet_wrap(~region) +
labs(title = "MLB representation across regions",
y = "proportion",
x = "year") +
theme_minimal()
Even into the 1950s and early 1960s, baseball was primarily a NE and MW game, with the South on par with the MW region in this period. This balance of MLB regional representation coincides with, as recently as 1958, no MLB team residing further West or South than St Louis.
ggplot(region_perc %>%
mutate(NEpMW = ifelse(region %in% c("MW", "NE"),1,0)) %>%
filter(NEpMW == 1) %>%
group_by(yearID) %>%
summarise(NEplusMW_perc = sum(region_perc)),
aes(x = yearID, y = NEplusMW_perc)) +
geom_line() +
labs(title = "MLB representation from NE and MW",
y = "proportion",
x = "year") +
ylim(0,1) +
theme_minimal()
We now estimate the talent pool from three main sources: UN and US Census. The talent pool will be age 20-29 males as in . These sources record population tallies every 10 years. Before 1950, this population will consist of only white males to reflect the segregation of the MLB during that time period.
We first obtain US Census counts of aged 20-29 white males from 1870 to 1940, and we obtain US Census counts of all aged 20-29 males for 1950. Baseball began the integration process in 1947. Inclusion of the US Black population into the 1950 count likely overstates the participation of this population in the MLB at this time due to continued segregation (See Armour and Levitt (2016) for details of MLB demographics). The fraction of the US population that is a 20-29 year old eligible male is also recorded for later use.
## Links to resources working through January, 2023.
### US population before 1960 from US Census
#https://www2.census.gov/library/publications/decennial/1870/vital-statistics/1870b-33.pdf
US1870 = 1.26 + 1.33
totalM1870 = 19.49
total1870 = 38.56
propM1870 = US1870/totalM1870
prop1870 = US1870/total1870
#https://www2.census.gov/library/publications/decennial/1880/vol-01-population/1880_v1-15.pdf
US1880 = 2.22 + 1.84
totalM1880 = 25.52
total1880 = 50.12
propM1880 = US1880/totalM1880
prop1880 = US1880/total1880
#https://www2.census.gov/library/publications/decennial/1900/volume-2/volume-2-p5.pdf
## We did not find the relevant population totals for 1890
## We compute the average of 1880 and 1900 as an educated guess.
US1900 = 2.63 + 2.36
US1890 = mean(c(US1880,US1900))
total1900 = 75.99
total1890 = 62.22
#https://www2.census.gov/library/publications/decennial/1910/volume-1/volume-1-p6.pdf
US1910 = 4.07 + 3.79
totalM1910 = 47.33
total1910 = 91.97
propM1910 = US1910/totalM1910
prop1910 = US1910/total1910
#https://www2.census.gov/library/publications/decennial/1920/volume-2/41084484v2ch03.pdf
US1920 = 4.02 + 4.09
totalM1920 = 53.90
total1920 = 105.71
propM1920 = US1920/totalM1920
prop1920 = US1920/total1920
#https://www2.census.gov/library/publications/decennial/1930/population-volume-2/16440598v2ch11.pdf
US1930 = 4.67 + 4.25
totalM1930 = 55.17
total1930 = 108.86
propM1930 = US1930/totalM1930
prop1930 = US1930/total1930
#https://www2.census.gov/library/publications/decennial/1940/population-volume-4/33973538v4p1ch1.pdf
US1940 = 1.08 + 1.06 + 1.01 + 1.00 + 1.01 +
1.01 + 0.99 + 0.98 + 0.97 + 0.95
totalM1940 = 66.06
total1940 = 131.67
propM1940 = US1940/totalM1940
prop1940 = US1940/total1940
#https://www2.census.gov/library/publications/decennial/1950/population-volume-2/21983999v2p1ch2.pdf
#https://www2.census.gov/library/publications/decennial/1950/pc-14/pc-14-05.pdf
US1950 = 5.61 + 5.97
totalM1950 = 74.83
total1950 = 150.70
propM1950 = US1950/totalM1950
prop1950 = US1950/total1950
prop_eligible = data.frame(year = c(1870, 1880, 1910, 1920, 1930, 1940, 1950),
prop = c(prop1870, prop1880, prop1910, prop1920,
prop1930, prop1940, prop1950))
We now obtain UN counts of aged 20-29 males from 1960 on.
## US population from UN
population_data = read.csv("population.csv", header = TRUE)[, -1] %>%
mutate(age20 = age20 / 1e3, age25 = age25 / 1e3) %>%
mutate(pop = age20 + age25) %>%
dplyr::select("region", "year", "pop") %>%
filter(year >= 1950)
population_data$region = as.factor(population_data$region)
population_data$region = recode_factor(
population_data$region, WORLD = "world", Canada = "Can",
"United States of America" = "US")
foo = population_data %>% filter(region %in% "US") %>%
filter(year >= 1960)
We now estimate the talent pool from each geographical region using US Census data and linear interpolation for years in which we could not find data.
## Get USA population and populations of regions
#https://www.census.gov/data/tables/time-series/dec/popchange-data-text.html
#gsub(" \t|\n" ,", ", gsub(",","", c("57,609,148 55,317,240 53,594,378
# 50,809,229 49,135,283 49,040,703 44,677,819 39,477,986 35,976,777
# 34,427,091 29,662,053 25,868,573 68,985,454 66,927,001 64,392,776
# 59,668,632 58,865,670 56,571,663 51,619,139 44,460,762 40,143,332
# 38,594,100 34,019,792 29,888,542 126,266,107 114,555,744 100,236,820
# 85,445,930 75,372,362 62,795,367 54,973,113 47,197,088 41,665,901
# 37,857,633 33,125,803 29,389,330 78,588,572 71,945,553 63,197,932
# 52,786,082 43,172,490 34,804,193 28,053,104 20,189,962 14,379,119
# 12,323,836 9,213,920 7,082,086")))
mat = cbind(seq(2020,1910,by=-10),
matrix(c(57609148, 55317240, 53594378, 50809229, 49135283,
49040703, 44677819, 39477986, 35976777, 34427091,
29662053, 25868573, 68985454, 66927001, 64392776,
59668632, 58865670, 56571663, 51619139, 44460762,
40143332, 38594100, 34019792, 29888542, 126266107,
114555744, 100236820, 85445930, 75372362, 62795367,
54973113, 47197088, 41665901, 37857633, 33125803,
29389330, 78588572, 71945553, 63197932, 52786082,
43172490, 34804193, 28053104, 20189962, 14379119,
12323836, 9213920, 7082086), ncol = 4))
df = as.data.frame(mat)
colnames(df) = c("year","NE","MW","S","W")
df$total = rowSums(df[, 2:5])
#https://www2.census.gov/library/publications/decennial/1870/population/1870a-04.pdf
NE1870 = 540000 + 130000 + 630000 + 1460000 + 320000 + 910000 +
4380000 + 3520000 + 220000 + 330000
# "IL", "IN", "IA", "KS", "MI", "MN", "MO", "NE", "ND", "OH", "SD", "WI",
MW1870 = 2540000 + 1680000 + 1190000 + 360000 + 1180000 + 440000 + 1720000 +
120000 + 2670000 + 1050000
#"AZ", "CA", "CO", "ID", "MT", "NM", "OR", "NV", "UT", "WA", "WY"
W1870 = 200000 + 560000 + 100000 + 100000
df = rbind(df, c(1900, NA, NA, NA, NA, total1900*1e6),
c(1890, NA, NA, NA, NA, total1890*1e6),
c(1880, NA, NA, NA, NA, total1880*1e6),
c(1870, NE1870, MW1870, total1870*1e6 - W1870 - NE1870 - MW1870, W1870, total1870*1e6))
df_prop = rbind(cbind(df[1:12, 1], df[1:12, 2:5] / df[1:12, 6]),
as.numeric(cbind(1870, df[16, 2:5] / df[16,6])))
colnames(df_prop)[1] = "year"
x1 = approx(c(1910,1870), tail(df_prop$NE,2), xout = seq(1900,1880,by=-10))$y
x2 = approx(c(1910,1870), tail(df_prop$MW,2), xout = seq(1900,1880,by=-10))$y
x3 = approx(c(1910,1870), tail(df_prop$S,2), xout = seq(1900,1880,by=-10))$y
x4 = approx(c(1910,1870), tail(df_prop$W,2), xout = seq(1900,1880,by=-10))$y
df_prop[13:15, 2] = x1
df_prop[13:15, 3] = x2
df_prop[13:15, 4] = x3
df_prop[13:15, 5] = x4
df_prop[16, ] = as.numeric(cbind(1870, df[16, 2:5] / df[16,6]))
df_prop[13:15, 1] = c(1900, 1890, 1880)
## check to make sure that the proportions sum to 1
rowSums(df_prop[, 2:5])
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
df[13:15, 2:5] = df_prop[13:15, 2:5] * df[13:15, 6]
We plot the estimated US population broken down by regions across years as well as the proportion of the estimated US population broken down by regions across years. We see that the US population has grown considerably over time, and that the US population has shifted from primarily NE and MW to primarily South and West as the population as grown with the inflection point occurring around 1975.
## proportion/totals of USA from each region
df_long = pivot_longer(df, cols = c("NE", "MW", "S", "W", "total"),
names_to = "region", values_to = "population")
df_long_inter = df_long %>% group_by(region) %>%
summarise(yearID = 1870:2020,
pop = approx(year, population, xout = 1870:2020)$y) %>%
group_by(yearID) %>%
summarise(region, pop, pop_perc = pop / sum(pop) * 2)
## total population of USA across regions
ggplot(df_long_inter, aes(x = yearID, y = pop, col = region)) +
geom_line() +
labs(title = "US population across regions",
y = "population",
x = "year") +
theme_minimal()
## total population of USA across regions
ggplot(df_long_inter %>% filter(region != "total") %>%
group_by(yearID) %>%
summarise("NE and MW" = sum(pop[region %in% c("NE","MW")]),
"South and West" = sum(pop[region %in% c("S","W")])) %>%
pivot_longer(cols = "NE and MW":"South and West", names_to = "region"),
aes(x = yearID, y = value, col = region)) +
geom_line() +
labs(title = "US population across regions",
subtitle = "MW and NE vs South and West",
y = "population",
x = "year") +
geom_vline(xintercept = 1975) +
theme_minimal()
## proportion of USA
ggplot(df_long_inter %>% filter(region != "total") %>%
group_by(yearID) %>%
summarise("NE and MW" = sum(pop_perc[region %in% c("NE","MW")]),
"South and West" = sum(pop_perc[region %in% c("S","W")])) %>%
pivot_longer(cols = "NE and MW":"South and West", names_to = "region"),
aes(x = yearID, y = value, col = region)) +
geom_line() +
labs(title = "US population across regions",
subtitle = "MW and NE vs South and West",
y = "proportion",
x = "year") +
geom_vline(xintercept = 1975) +
ylim(0,0.75) +
theme_minimal()
We show that a similar inflection point in the proportion of MLB representation happens in 1962. After 1962, the MLB is primarily represented by people from the South and West regions and international sources.
region_perc_pop = region_perc %>%
left_join(df_long_inter) %>%
filter(yearID <= 2020) %>%
group_by(yearID) %>%
summarise("NEpMW_MLB" = sum(region_perc[region %in% c("NE","MW")], na.rm = TRUE),
"SpW_MLB" = sum(region_perc[region %in% c("S","W")],
na.rm = TRUE),
"SpWpINTL_MLB" = sum(region_perc[!region %in% c("NE","MW")],
na.rm = TRUE),
"NEpMW_pop" = sum(pop[region %in% c("NE","MW")], na.rm = TRUE),
"SpW_pop" = sum(pop[region %in% c("S","W")], na.rm = TRUE),
"SpWpINTL_pop" = sum(pop[!region %in% c("NE","MW")], na.rm = TRUE),
"NEpMW_pop_prop" = sum(pop_perc[region %in% c("NE","MW")], na.rm = TRUE),
"SpW_pop_prop" = sum(pop_perc[region %in% c("S","W")], na.rm = TRUE))
region_perc_pop %>%
rename("NE and MW" = NEpMW_MLB,
"S, W, and INTL" = SpWpINTL_MLB) %>%
pivot_longer(cols = c("NE and MW","S, W, and INTL"),
names_to = "region") %>%
ggplot() +
aes(x = yearID, y = value, col = region) +
geom_vline(xintercept = 1962) +
geom_line() +
theme_minimal() +
labs(title = "MLB representation across regions",
y = "proportion",
x = "year") +
ylim(0,1)
We also see that from 1871-1925, the MLB was over-represented by people from the NE and MW regions. From 1925-1962, the MLB was nearly properly represented by people from the NE and MW regions. From 1962-present, the MLB was under-represented by people from the NE and MW regions.
region_perc_pop %>%
mutate(NEpMW_MLB = NEpMW_MLB/(NEpMW_MLB + SpW_MLB)) %>%
rename("NE and MW in MLB" = NEpMW_MLB,
"NE and MW in US" = NEpMW_pop_prop) %>%
pivot_longer(cols = c("NE and MW in MLB","NE and MW in US"),
names_to = "representation") %>%
ggplot() +
aes(x = yearID, y = value, col = representation) +
geom_line() +
theme_minimal() +
labs(title = "NE and MW representation for the MLB and the US",
y = "proportion",
x = "year") +
geom_vline(xintercept = 1925) +
geom_vline(xintercept = 1962) +
ylim(0,1)
We now estimate the talent pool (not yet adjusted for changing level interest). The main reasons for the break point in the early 1960s are:
Below we calculate and plot the proportion of the US population that belongs to the talent pool (age 20-29 males). We see a pronounced post World War II drop followed by a surge corresponding to when people from the Baby Boom generation (born between 1946 and 1964) started coming of age to play in the MLB.
prop_eligible = rbind(prop_eligible,
foo %>% left_join(df %>% dplyr::select(year, total)) %>%
mutate(prop = pop * 1e6/total) %>%
dplyr::select(year, prop))
# https://www.census.gov/data/tables/2020/demo/age-and-sex/2020-age-sex-composition.html
prop_eligible = rbind(prop_eligible,
c(2020, (10.573 + 11.682) / 325.268 ))
prop_eligible_long = data.frame(
year = 1870:2020,
prop_elig = approx(prop_eligible$year, prop_eligible$prop, xout = 1870:2020)$y
)
prop_eligible_long$prop_elig_smooth =
predict(lm(prop_elig ~ ns(year, df=13), data=prop_eligible_long))
ggplot(prop_eligible_long, aes(x = year, y = prop_elig)) +
ylim(0, 0.10) +
geom_line(aes(x = year, y = prop_elig_smooth)) +
labs(title = "Proportion of US population that is male age 20-29",
y = "proportion",
x = "year") +
theme_minimal() +
geom_vline(xintercept = 1944.5, lty = 2) +
geom_vline(xintercept = 1966, lty = 2, col = "red") +
geom_vline(xintercept = 1984, lty = 2, col = "blue") +
annotate("text", x = c(1937,1967,1985), y = c(0.03,0.02,0.01),
label = c("D-Day","20 years after start of Baby Boom",
"20 years after end of Baby Boom"),
color = c("black","red","blue"))
We now multiple the population of people from the NE and MW regions by the proportion of that population that is male age 20-29. This gives us an estimate of the talent pool that is not yet adjusted for external factors.
region_perc_pop_NEMW = region_perc_pop %>%
left_join(prop_eligible_long, by = c("yearID" = "year")) %>%
mutate(MLBeligibleNEMW = NEpMW_pop/NEpMW_MLB * prop_elig)
region_perc_pop_NEMW = region_perc_pop_NEMW[1:150, ]
region_perc_pop_NEMW %>%
filter(yearID %in% c(1871, seq(from = 1880, to = 2020, by = 10))) %>%
dplyr::select(yearID, MLBeligibleNEMW) %>% as.data.frame()
## yearID MLBeligibleNEMW
## 1 1871 2259662
## 2 1880 3167278
## 3 1890 4101846
## 4 1900 4783564
## 5 1910 6320357
## 6 1920 7929134
## 7 1930 11567837
## 8 1940 12102545
## 9 1950 12296708
## 10 1960 12539318
## 11 1970 20335683
## 12 1980 28402872
## 13 1990 27805051
## 14 2000 29086696
## 15 2010 49622223
## 16 2020 46715012
ggplot(region_perc_pop_NEMW, aes(x = yearID, y = MLBeligibleNEMW)) +
geom_line() +
labs(title = "Talent pool (not adjusted for interest)",
y = "population",
x = "year") +
theme_minimal()
We zoom in on the years between 1950 and 1975. We see that the proportion of the MLB that is from the NE and MW (\(\texttt{NEpMW_MLB_prop}\)) starts stable but starts to decline rapidly in the mid/late 1960s and into 1970s, and the proportion of the USA population that is an aged 20-29 male (\(\texttt{prop_elig}\)) increases dramatically with increases from 1960 to 1975.
region_perc_pop_NEMW %>% select(yearID, NEpMW_MLB,
prop_elig, MLBeligibleNEMW) %>%
rename("NEpMW_MLB_prop" = NEpMW_MLB) %>%
filter(yearID >= 1950, yearID <= 1975) %>%
as.data.frame()
## yearID NEpMW_MLB_prop prop_elig MLBeligibleNEMW
## 1 1950 0.5245283 0.07684141 12296708
## 2 1951 0.5174954 0.07552082 12429972
## 3 1952 0.5251799 0.07420024 12208525
## 4 1953 0.5187266 0.07287965 12314050
## 5 1954 0.5298507 0.07155906 12003975
## 6 1955 0.5058431 0.07023848 12513250
## 7 1956 0.4836957 0.06891789 13016248
## 8 1957 0.5026643 0.06759731 12451253
## 9 1958 0.4948276 0.06627672 12566871
## 10 1959 0.4973730 0.06495614 12414835
## 11 1960 0.4886957 0.06363555 12539318
## 12 1961 0.5032258 0.06484700 12529119
## 13 1962 0.4985755 0.06605845 13005652
## 14 1963 0.4561151 0.06726989 14614469
## 15 1964 0.4514286 0.06848134 15173424
## 16 1965 0.4444444 0.06969279 15830576
## 17 1966 0.4117647 0.07090423 17544396
## 18 1967 0.4136490 0.07211568 17925273
## 19 1968 0.4023669 0.07332713 18907216
## 20 1969 0.3945819 0.07453858 19774755
## 21 1970 0.3934040 0.07575002 20335683
## 22 1971 0.3804348 0.07733364 21517118
## 23 1972 0.3701456 0.07891725 22619037
## 24 1973 0.3654080 0.08050087 23424695
## 25 1974 0.3600465 0.08208448 24295650
## 26 1975 0.3744076 0.08366809 23867865
We see that MLB representation from MW and NE declines slightly with international and the Western states improving their MLB representation over 1950-1975.
## By region
ggplot(region_perc %>%
filter(yearID >= 1950, yearID <= 1975) %>%
filter(region %in% c("INTL", "NE", "W", "MW", "S")),
aes(x = yearID, y = region_perc)) +
geom_line() +
facet_wrap(~region) +
labs(title = "MLB representation across regions",
subtitle = "1950-1975",
y = "proportion",
x = "year") +
ylim(0,0.4) +
theme_minimal()
We will first adjust the talent pool for changes in baseball interest over time. We will use these sources:
The Harris and Gallup polling data on general baseball interest (are you a fan/follower of baseball) indicates that baseball interest has been relatively constant throughout the years, although it has deceased. However, the proportion of Americans who view baseball as their favorite sport has dramatically fallen. Note that Harris reports a higher proportion of the population whose favorite sport is baseball than Gallup. We will average the survey results. We will then define interest in baseball as the average of general baseball interest and those who consider baseball their favorite sport. This reflects the fact that some MLB players list other sports as their favorite (For example, Tony Gwynn loved basketball before baseball). Other aggregations will be considered in the Appendix (75% favorite sport is baseball and 25% general baseball interest; 100% favorite sport is baseball).
We could not find survey data about baseball interest prior to 1937. We will set baseball interest at 0.20, 0.20, 0.22, 0.30, and 0.38 respectively for years 1870, 1876, 1886, 1920, and 1930. We have endeavored to make reasonable choices for interest levels of young white adult males towards baseball over time, starting in the 1870s into the current era. This starting point of interest level for 1870 aims to balance the geographic area where the sport was played and increasing awareness from greater organization and media coverage of professional baseball. Indeed, our aim is looking at what conditions influence the interest of the generation of young white males that shape their fandom or even aspiration of becoming a ballplayer and how that increases and decreases over time. Thus, formation of the National League in 1876 would spur interest. So would the emergence of Babe Ruth in 1919-1920 not only as a home-run slugger but also as the well-covered “Sultan of Swat,” a media sensation whose popularity was aided by the newsprint media.
In a certain sense, we are generous in placing the 1870 interest level among the white male population interested in baseball at 20 percent. Baseball salaries were far lower in the past, media exposure was far less than it is in more modern eras, and baseball teams would disband with some frequency in the older eras. These early era realities may negatively effect one’s interest in pursuing a career in baseball. Our analysis does not make any explicit adjustments for changing salaries for MLB salaries, but it does explicitly account for increased competition from other sports. A thorough accounting for changing salaries is beyond the scope of this work, but it may have a considerable negative effect on the population pursuing a career in the MLB prior to more modern times. As recently as 1967, the MLB minimum salary was $6000 and the MLB average salary was $19,000 while the median household income was $7200. In 2022 the real median income is roughly $75,000 while the MLB minimum salary is $700,000 and the MLB average salary is roughly $4,000,000. Before 1967, detailed salary data becomes sparse. This article reports that the average MLB salary in 1920 was $5000 which is less than double the median household income at the time which was about $3300.
However, it can be said that baseball may be a source of social mobility for lower-class men even though MLB salaries were lower in the past. Riess (1980) studied professional baseball and social mobility in the Progressive era. He concluded that professional baseball was greatly overrated as a source of upward mobility when the game was at its unchallenged height, and that MLB jobs went mostly to well-educated, middle-class native American whites, Irishmen, and Germans who utilized the sport as a means of maintaining their social status and planned to use it as a ladder to future success. In summarizing Riess’s 1980 book Touching Base, Gerlach (1994) stated that the depiction of baseball as an “all-American game embodying all American virtues and values” was the prescriptive construct of a burgeoning baseball business geared to and supported by a native-born middle class.
It should be noted that there were more professional opportunities to play baseball in the past even though MLB salaries were lower, and that the relatively fewer professional opportunities that exist today may reflect a fundamental lack of interest in reaching the MLB that is not captured by surveys. In a study of structural mutualism–cooperative behavior among teams to the benefit of each due to the benefit of their membership in leagues–with respect to minor league baseball team and league foundings and disbandings, Land, Davis, and Blau (1994) charted the density of minor leagues and minor league teams. This chart clearly revealed a dramatic decrease in minor league operations after a pronounced post-WWII peak. The number of teams (leagues) was at 448 (59) in 1949, and fell to 132 (20) in 1964 (see here for details). As noted by Land, Davis, and Blau (1994), Sullivan’s 1990 book The Minors attributed this decline in the minor leagues to relocation of MLB teams to minor league markets and the spread of television broadcasting games. Bellamy and Walker (2004) noted that television went from 0.4 percent of U.S. households in 1948 to 87.1 percent in 1960. However, these authors concluded that the extension of MLB radio broadcasts, both national and local, in combination with a market correction, probably hurt the Minor Leagues, particularly in the first half of the 1950s, much more than the televised broadcasts that were in their infancy.
Land, Davis, and Blau (1994) stated that: “the single most important event in the history of the minors occurred in 1920s, with the implementation of the farm system…This system allowed the Cardinals to field extremely competitive teams without having to pay high prices to the minor-league clubs that owned the player’s contracts. The New York Yankees soon adopted this strategy and dominated the AL in the 1930s. It is not clear to what extent other teams wanted, or were able, to use this system during the 1930s, but it clearly was common.” Horowitz (1997) studied competitive balance in the MLB and speculated that “the owners, however, solely for business purposes, embraced expansion and allowed their lust for broadcast-rights revenues to hasten the erosion of the minor leagues and their farm systems.” However, footnote 9 in Land, Davis, and Blau (1994) stated that the majority of minor league teams that died were independent teams not affiliated with an MLB farm system.
In aggregate, it appears that collapse of professional baseball (proxied by the Minors) is a mere shuffling around and does not reflect some deep collapse in the talent pool. Thus, the rapid contraction of professional opportunities (proxied by the Minor Leagues) can be explained, albeit not completely, by factors (relocation of MLB teams to minor league markets, extensions of MLB radio and television broadcasts, a possible market correction, and structural mutualism) that do not necessarily signal a corresponding rapid decrease in interest in playing baseball professionally which would not be captured by surveys. Moreover, per game attendance has steadily risen over time with some peaks and valleys corresponding to historical events. That being said, in the Appendix we will consider an estimate of the talent pool in which interest in baseball is higher during the years preceding the erosion of the minor leagues.
MLBattendance = read_csv("MLBattendance.csv")
colnames(MLBattendance)[3] = "AttendPerGame"
MLBattendance %>%
ggplot() +
aes(x = Year, y = AttendPerGame) +
geom_line() +
theme_minimal() +
labs(title = "Attendance per game per year") +
geom_vline(xintercept = 1919, lty = 2) +
geom_vline(xintercept = 1928, lty = 2, col = "red") +
geom_vline(xintercept = 1944.5, lty = 2, col = "blue") +
geom_vline(xintercept = 1994.5, lty = 2, col = "brown") +
annotate("text", x = c(1899,1921,1938,1966,1978,2006.5),
y = c(9000,14000,19000,26000,30000,24000),
label = c("End of deadball era",
"Stock market crash",
"D-Day",
"Minor League decline",
"Expansion era",
"Labor strike"),
color = c("black","red","blue", "grey", "orange","brown")) +
annotate("rect", fill = "grey", alpha = 0.2,
xmin = 1949, xmax = 1960,
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "orange", alpha = 0.2,
xmin = 1961, xmax = 1999,
ymin = -Inf, ymax = Inf)
We now add our estimate of baseball interest. We will use interpolation for missing seasons. Smoothing will be applied. This smoothing helps with jaggedness and handles the jump in interest due to switching from Gallup to the average of Gallup and Harris. Historical interest in baseball is estimated and plotted below.
## Compute interest adjustment
#https://news.gallup.com/poll/4735/sports.aspx
#https://www.mprnews.org/story/2015/05/11/is-baseball-dying-no
favorite_Gallup = data.frame(
year = c(1937, 1948, 1960, 1972, 1981, 1990, 1994, 1998, 2000, 2001, 2002,
2003, 2004, 2005, 2006, 2007, 2008, 2013, 2014, 2017) + 10,
interest = c(0.34, 0.39, 0.34, 0.22, 0.16, 0.16, 0.18, 0.12, 0.13, 0.12, 0.12,
0.10, 0.10, 0.12, 0.11, 0.13, 0.10, 0.14, 0.14, 0.09)#/2 + 0.42/2
)
#Harris poll
#https://www.sportsbusinessjournal.com/Daily/Issues/2015/01/28/Research-and-Ratings/Harris-Poll.aspx
#https://www.businessinsider.com/most-popular-sports-in-the-us-2016-3
favorite_Harris = data.frame(year = c(1985, 1989, 1992, 1998, 2009, 2010,
2011, 2012, 2013, 2014, 2016) + 10,
interest = c(0.23, 0.19, 0.21, 0.18, 0.16, 0.17,
0.13, 0.16, 0.14, 0.16, 0.15))
## Combined information
favorite = data.frame(
year = 1947:2020,
interest = approx(favorite_Gallup$year, favorite_Gallup$interest, xout = 1947:2020)$y
)
favorite[favorite$year >= 1995, ]$interest =
approx(favorite_Gallup$year, favorite_Gallup$interest, xout = 1995:2020)$y / 2 +
approx(favorite_Harris$year, favorite_Harris$interest, xout = 1995:2020)$y / 2
#https://news.gallup.com/poll/6745/baseball-fan-numbers-steady-decline-may-pending.aspx
#https://onlinemasters.ohio.edu/blog/mlb-young-fans/
#https://boxingjunkie.usatoday.com/2021/06/harris-poll-boxing-more-popular-than-mma-hockey-tennis-golf
interest = data.frame(
year = c(1937,1952,2001,2021) + 10,
interest = c(0.62,0.62,0.50,0.49)
)
# 50% for favorite; 50% for general interest
MLB_interest = data.frame(
year = 1947:2020,
interest = (approx(favorite$year, favorite$interest, xout = 1947:2020)$y +
approx(interest$year, interest$interest, xout = 1947:2020,
rule = 2)$y)/2
)
# The existence of the MLB itself
MLB_interest = rbind(c(1870,0.20), c(1876, 0.20), c(1886, 0.22), c(1920,0.30),
c(1930, 0.38), MLB_interest)
dat_interest = data.frame(
year = 1870:2020,
interest = approx(MLB_interest$year, MLB_interest$interest, rule = 2,
xout = 1870:2020)$y
)
dat_interest$interest_smooth = predict(lm(interest ~ ns(year, df=12), data=dat_interest))
ggplot(dat_interest, aes(x = year, y = interest)) +
labs(title = "Estimated interest in baseball",
subtitle = "as a proportion of total US population",
y = "proportion",
x = "year") +
geom_line(aes(x = year, y = interest_smooth), lwd = 1) +
theme_minimal() +
ylim(c(0,0.60))
We now plot the talent pool adjusted for interest.
region_perc_pop_NEMW = region_perc_pop %>%
mutate(MLBeligibleNEMW = NEpMW_pop/NEpMW_MLB) %>%
left_join(prop_eligible_long, by = c("yearID" = "year")) %>%
left_join(dat_interest, by = c("yearID" = "year")) %>%
mutate(MLBeligibleNEMW_adj = MLBeligibleNEMW * interest_smooth * prop_elig)
region_perc_pop_NEMW = region_perc_pop_NEMW[1:150, ]
region_perc_pop_NEMW$MLBeligibleNEMW_adj_smooth =
round(predict(lm(MLBeligibleNEMW_adj ~ ns(yearID, df=13), data=region_perc_pop_NEMW)))
ggplot(region_perc_pop_NEMW, aes(x = yearID, y = MLBeligibleNEMW_adj)) +
geom_point() +
geom_line(aes(x = yearID, y = MLBeligibleNEMW_adj_smooth), color = "red") +
labs(title = "Estimated talent pool adjusted for interest",
y = "population",
x = "year") +
theme_minimal()
We take a closer look at the period from 1995-2010 which shows a dramatic increase in the talent pool. We see that the proportion of the population that is a member of the talent pool (\(\texttt{prop_elig}\)) has a sharp decline and modest rebound over this time period. Interest in baseball exhibits a modest decrease over these years. The main driver of this massive increase in the talent pool is due to the percentage of baseball players from the NE and MW region decreasing over this time period.
region_perc_pop_NEMW %>% dplyr::select(yearID, NEpMW_MLB, prop_elig, MLBeligibleNEMW_adj_smooth, interest) %>%
filter(yearID >= 1995, yearID <= 2010) %>%
as.data.frame()
## yearID NEpMW_MLB prop_elig MLBeligibleNEMW_adj_smooth interest
## 1 1995 0.3283319 0.07631334 9441303 0.3670918
## 2 1996 0.3163891 0.07496025 9508935 0.3633673
## 3 1997 0.3045414 0.07360717 9669798 0.3596429
## 4 1998 0.3187184 0.07225408 9932646 0.3559184
## 5 1999 0.3052109 0.07090099 10286646 0.3521939
## 6 2000 0.2821138 0.06954790 10714803 0.3526361
## 7 2001 0.2696721 0.06987314 11200119 0.3543282
## 8 2002 0.2594417 0.07019839 11725596 0.3560204
## 9 2003 0.2373984 0.07052363 12274237 0.3547959
## 10 2004 0.2365678 0.07084888 12829045 0.3535714
## 11 2005 0.2166532 0.07117412 13373022 0.3473469
## 12 2006 0.2045089 0.07149936 13889172 0.3411224
## 13 2007 0.1979656 0.07182461 14360495 0.3348980
## 14 2008 0.1859024 0.07214985 14769996 0.3286735
## 15 2009 0.1816746 0.07247509 15101080 0.3282444
## 16 2010 0.1793435 0.07280034 15348376 0.3278154
We can see that the decrease in the proportion of MLB players from the NE and MW regions has corresponded with large increases to the proportion of MLB players from the South and West regions as well as a large influx of foreign talent (as seen above).
ggplot(region_perc %>%
filter(yearID >= 1995, yearID <= 2010) %>%
filter(region %in% c("INTL", "NE", "S", "MW", "W")),
aes(x = yearID, y = region_perc)) +
geom_line() +
facet_wrap(~region) +
ylim(c(0, 0.35)) +
labs(title = "MLB representation across regions",
subtitle = "1995-2010",
y = "proportion",
x = "year") +
theme_minimal()
It is worth noting that MLB expansion and team relocations that happened from 1970 to 1998:
We look at a few states over this time period. Interestingly, we find that the proportion of MLB players from California declines with New York over this time frame. Florida has a modest U-shaped proportion of MLB players from the state. Both Texas and Washington have increases over this time period.
ggplot(state_perc %>%
filter(yearID >= 1995, yearID <= 2010) %>%
filter(birthState %in% c("FL", "CA", "NY", "TX", "WA")),
aes(x = yearID, y = state_perc)) +
geom_line() +
ylim(c(0,0.22)) +
facet_wrap(~birthState) +
labs(title = "MLB representation from select states",
subtitle = "1995-2010",
y = "proportion",
x = "year") +
theme_minimal()
Interestingly, the Florida increase extends into 2020 while Texas slides, New York rebounds, and California levels off.
ggplot(state_perc %>%
filter(yearID >= 2010, yearID <= 2020) %>%
filter(birthState %in% c("FL", "CA", "NY", "TX")),
aes(x = yearID, y = state_perc)) +
geom_line() +
ylim(c(0,0.22)) +
facet_wrap(~birthState) +
labs(title = "MLB representation from select states",
subtitle = "2010-2020",
y = "proportion",
x = "year") +
theme_minimal()
Overall, the proportion of MLB players from different regions stabilizes from 2010-2020.
## By region
ggplot(region_perc %>%
filter(yearID >= 2010, yearID <= 2020) %>%
filter(region %in% c("INTL", "NE", "S", "MW", "W")),
aes(x = yearID, y = region_perc)) +
geom_line() +
facet_wrap(~region) +
ylim(c(0, 0.45)) +
labs(title = "MLB representation across regions",
subtitle = "2010-2020",
y = "proportion",
x = "year") +
theme_minimal()
We now create a smaller data set that contains the talent pool at each year.
datMLBeligible = data.frame(
year = region_perc_pop_NEMW$yearID,
pop = region_perc_pop_NEMW$MLBeligibleNEMW_adj_smooth
)
We now adjust for the effects of wars (WWI, WWII, and Korean War). For WWI military service was minimal so we consider an interest level of 0.85 for the talent pool for years 1918 and 1919.
# WWI
datMLBeligible[datMLBeligible[, 1] %in% c(1918:1919), ]$pop =
datMLBeligible[datMLBeligible[, 1] %in% c(1918:1919), ]$pop * c(0.85,0.85)
This article states that roughly 100 MLB players served in the Korean War. We’ll use 0.85 as the interest level for both 1952 and 1953. We will also ignore a lingering talent drain due to the Korean War because military service was not nearly on Americans’ minds as it was in WWII.
# Korean WAR
datMLBeligible[datMLBeligible[, 1] %in% c(1952:1953), ]$pop =
datMLBeligible[datMLBeligible[, 1] %in% c(1952:1953), ]$pop *
c(0.85,0.85)
We make an extensive and detailed adjustment for WWII following Gary Bedingfield’s timeline. The steps we followed were to count the proportion of players who were active MLB players the year before they served. If a player did not have any post war stats and no credible case of returning to the MLB then they were removed from the MLB for one season. Players like Bobby Thomson who served in WWII but returned in their age 22 or younger season were not counted as MLB players who missed time due to WWII. This timeline says 7 and 18 MLB players served in WWII in 1941 and 1942, respectively. Unfortunately, Gary Bedingfield’s timeline stops in 1942. Afterwards, Gary states that there were 195, 340, and 384 MLB players serving in WWII in, respectively, 1943, 1944, and 1945. However, his count seems to include people who played in the MLB and retired or those that would go on to play in the MLB in addition to active MLB players who left the league for service in WWII. So we will calculate the number of people who only played in the MLB from 1943-1946 as a proxy for the number of people serving in the MLB during the WWII years using the \(\texttt{People}\) data set in the \(\texttt{Lahman}\) package. This approach yielded 90, 166, and 170 MLB players serving in WWII in, respectively, 1943, 1944, and 1945. The roster size was set at 25 players in 1941-1944 (400 total spots) and 30 players in 1945 (480 total spots). We then incorporate a two-year lingering effect to the talent pool in which we use the 1943 interest level for 1946 and the 1942 interest level for 1947. This gives interest levels from 1941-1947: 0.983 0.96 0.78 0.59 0.65 0.78 0.96.
# WWII
datMLBeligible[datMLBeligible[, 1] %in% c(1941:1947), ]$pop =
datMLBeligible[datMLBeligible[, 1] %in% c(1941:1947), ]$pop *
(1- (c(7,18,90,166,170,90,18) / c(400,400,400,400,480,400,400)))
datMLBeligible$pop = round(datMLBeligible$pop)
We display the talent pool adjusted for baseball interest and wars.
ggplot(datMLBeligible, aes(x = year, y = pop)) +
geom_line() +
labs(title = "Estimated talent pool adjusted for interest and war",
y = "population",
x = "year") +
theme_minimal()
Jackie Robinson broke the color barrier in the NL in 1947. Shortly after Larry Doby broke the color barrier in the AL. Our talent pool includes integration from 1951-1960. Armour (2007) showed that the pace of integration was different in the NL and AL, with the AL integrating at a slower pace. We now correct for this by an adjustment to the talent pool applied strictly to the AL. This adjustment will be:
\[ \text{AL integration adjustment} = \frac{\% \text{white win shares in NL}}{\% \text{white win shares in AL}}, \] where win shares is a metric developed by Bill James which allocates team wins to individual players (as stated in Armour (2007)), and percentage white win shares is the percentage of total win shares from the white players. The values for this adjustment are taken from Figures 5 and 7 in Armour (2007). We will assume that the leagues are both equally integrated when the gap in percentage white win shares closes in 1973.
## Figure 5
foo = approx(x = c(1947, 1960, 1965, 1972, 1986),
y = c(1, 0.90, 0.80, 0.70, 0.70),
xout = 1947:1974)
## Figure 7
bar = approx(x = c(1947, 1950, 1954, 1959, 1968, 1973, 1974),
y = c(0, 0.05, 0.05, 0.17, 0.16, 0.01, 0),
xout = 1947:1974)
integration_adjustment = data.frame(
year = foo$x, adjustment = (foo$y - bar$y)/foo$y )
datMLBeligible$ALpop = datMLBeligible$pop
datMLBeligible[datMLBeligible$year %in% 1947:1974, ]$ALpop =
round(datMLBeligible[datMLBeligible$year %in% 1947:1974, ]$ALpop *
integration_adjustment$adjustment)
We now depict our estimated talent pool for the NL (blue line) and the AL (red line).
ggplot(datMLBeligible %>%
rename(NL = pop, AL = ALpop) %>%
pivot_longer(cols = NL:AL, names_to = "league"),
aes(x = year, y = value, color = league)) +
geom_line() +
labs(title = "Estimated talent pool",
subtitle = "adjusted for interest, war, and different rates of integration",
y = "population") +
theme_minimal()
We overlay our estimated MLB population with key historical events.
ggplot(datMLBeligible %>%
rename(NL = pop, AL = ALpop) %>%
pivot_longer(cols = NL:AL, names_to = "league"),
aes(x = year, y = value, color = league)) +
geom_line() +
labs(title = "Estimated talent pool",
subtitle = "adjusted for interest, war, and different rates of integration",
y = "population") +
geom_vline(xintercept = 1919, lty = 2) +
geom_vline(xintercept = 1928, lty = 2, col = "red") +
geom_vline(xintercept = 1944.5, lty = 2, col = "blue") +
geom_vline(xintercept = 1994.5, lty = 2, col = "brown") +
annotate("text", x = c(1894,1903,1936,1960,1980,2010),
y = c(2.6e6,5e6,7e6,1.5e7,1.25e7,7.5e6),
label = c("End of deadball era",
"Stock market crash",
"D-Day",
"Minor League decline",
"Expansion era",
"Labor strike"),
color = c("black","red","blue", "grey", "orange","brown")) +
annotate("rect", fill = "grey", alpha = 0.2,
xmin = 1949, xmax = 1960,
ymin = -Inf, ymax = Inf) +
annotate("rect", fill = "orange", alpha = 0.2,
xmin = 1961, xmax = 1999,
ymin = -Inf, ymax = Inf) +
theme_minimal()
The code below (not run) saves our estimates of the MLB talent pool and baseball interest to disk. The \(\texttt{0.5_favorite_sport}\) part of the file name refers to our estimate of baseball interest to be 50% from surveys of people listing baseball as their favorite sport, and 50% from surveys of general interest in baseball. In the next section, we will show that this estimate of baseball interest yields an estimate of the talent pool that aligns with an interest-adjusted estimate of the talent pool from a combination of baseball-playing Latin American countries.
#write_csv(datMLBeligible, "datMLBeligible_0.5_favorite_sport.csv")
#write_csv(dat_interest, "dat_interest_favorite_0.5_sport.csv")
We now investigate whether our estimated talent pool aligns with the baseball interest adjusted talent pools of Dominican Republic, Venezuela, and Puerto Rico, the three countries that immediately trail US in total MLB players and are known to have a rich history of baseball.
foo_tot %>% pull(birthCountry) %>% table() %>% sort(., decreasing = TRUE) %>% head(4)
## .
## USA D.R. Venezuela P.R.
## 89030 4516 2457 1903
It is claimed that over 800,000 children in Dominican Republic play baseball (linked source published after 2010). We see that the number of age 5-14 male children in Dominican Republic is roughly 1 million.
## number of age 5-14 male children in DR (in thousands)
WPP = read_csv("WPP_UN.csv")
WPP %>%
filter(Date == 2010, Region == "Dominican Republic") %>%
dplyr::select(`5-9`, `10-14`) %>%
sum()
## [1] 1030.24
The Little League Baseball in Venezuela website contains survey data for baseball interest in Venezuela and in Latin America (detailed survey information published in 1998). According to this website, 74% of the boys said that they like to play baseball compared to 36% of girls. This reference has a breakdown by socio-economic status: 52% and 60% of Venezuelan kids in the top 10% of income liked to play and watch baseball respectively while those numbers increase to 67% and 73% for the poorest kids in the bottom 40% of income.
Our previous estimates of the talent pool from these two countries placed recent baseball interest in Dominican Republic and Venezuela at, respectively, 75% and 85%. Given the tally of baseball playing children in Dominican Republic and the gender breakdown from Venezuela, we will change the Dominican Republic interest to 90% (noting that the fan interest in baseball is likely higher than participation and this is missing for Dominican Republic). We now compute the baseball interest adjusted population from Dominican Republic and Venezuela.
We will specify the baseball interest in Puerto Rico to be 82.5% (halfway between interest in Venezuela and Dominican Republic). We base this estimate on this article which shows that Puerto Rico is first in attendance per game per million people in the population and is competitive with, but trailing, Dominican Republic and Venezuela in baseball related internet searches for baseball.
# https://egrove.olemiss.edu/cgi/viewcontent.cgi?article=1750&context=hon_thesis
# Wasch (2009) study
# https://static1.squarespace.com/static/5914832ae58c629214e9fed7/t/591c9b62f5e231ade75a7ad8/1495047010650/Fun+Facts.pdf
# https://www.godominicanrepublic.com/
# Over 800,000 children play organized baseball in DR
population_data$region = recode_factor(
population_data$region, "Dominican Republic" = "DR")
foo = population_data %>% filter(region %in% c("DR")) %>%
mutate(pop = pop * 0.95)
#https://www.worldometers.info/world-population/dominican-republic-population/
# DR population in 2022 divided by DR population in 2010
# population estimate taken January 16, 2023
foo[8, 1] = foo[7, 1]; foo[8, 2:3] = c(2022, foo[7,3] * 11.12 / 9.7)
bar = spread(foo, region, pop)
baz = approx(x = bar$year, y = bar$DR, xout = 1950:2022, n = 71)
datDR = data.frame(year = baz$x, pop = baz$y)
population_data$region = recode_factor(
population_data$region, "Venezuela (Bolivarian Republic of)" = "Ven")
foo = population_data %>% filter(region %in% c("Ven"))
#https://www.worldometers.info/world-population/venezuela-population/
# Ven population in 2022 divided by Ven population in 2010
# population estimate taken January 16, 2023
foo[8, 1] = foo[7, 1]; foo[8, 2:3] = c(2022, foo[7,3] * 28.232 / 28.44)
bar = spread(foo, region, pop)
baz = approx(x = bar$year, y = bar$Ven, xout = 1950:2022)
datVen = data.frame(year = baz$x, pop = baz$y)
# http://www.zonalatina.com/Zldata134.htm
# 1998: 74% of boys play baseball; 71% watch baseball
# most popular among poorest kids (liked to play baseball: top 10%: 52%; bottom 40%: 67%)
datVen[datVen$year >= 1967, 2] = 0.75 * datVen[datVen$year >= 1967, 2]
datVen[datVen$year < 1967, 2] = 0.40 * datVen[datVen$year < 1967, 2]
population_data$region = recode_factor(
population_data$region, "Puerto Rico" = "PR")
foo = population_data %>% filter(region %in% c("PR"))
#https://www.worldometers.info/world-population/puerto-rico-population/
# Puerto Rico population in 2022 divided by PR population in 2010
# population estimate taken March 01, 2024
foo[8, 1] = foo[7, 1]; foo[8, 2:3] = c(2022, foo[7,3] * 3.252 / 3.72)
bar = spread(foo, region, pop)
baz = approx(x = bar$year, y = bar$PR, xout = 1950:2022)
datPR = data.frame(year = baz$x, pop = baz$y)
# http://www.zonalatina.com/Zldata134.htm
# 1998: 74% of boys play baseball; 71% watch baseball
# most popular among poorest kids (liked to play baseball: top 10%: 52%; bottom 40%: 67%)
datPR[, 2] = 0.825 * datPR[, 2]
We now plot the baseball-interested adjusted population from Dominican Republic and Venezuela (red line) and the talent pool multiplied by the percentage of representation from these two countries (blue line). We note that these two quantities are in general agreement after the rapid influx of talent from these two countries leveled off.
## DR and Venezuela and PR
datLA = data.frame(
year = datDR$year,
pop = datDR$pop + datVen$pop + datPR$pop
) %>% filter(year >= 1980)
bazcombined = datMLBeligible %>% filter(year >= 1980)
bazcombined$perc_MLB_LA =
foo_tot %>% group_by(yearID) %>%
summarise(perc_MLB_LA =
sum(ifelse(birthCountry %in% c("D.R.", "Venezuela", "P.R."),1,0))/n()) %>%
filter(yearID >= 1980, yearID <= 2020) %>% pull(perc_MLB_LA)
bazcombined = bazcombined %>% mutate(pop_MLB_LA = pop*perc_MLB_LA) %>%
dplyr::select(-pop)
bazcombined$LApop_interest = datLA %>% filter(year <= 2020) %>% pull(pop) * 1e6
ggplot(bazcombined %>%
rename(LA_eligible_population = LApop_interest,
MLB_representation_from_LA = pop_MLB_LA) %>%
pivot_longer(cols = c(LA_eligible_population,
MLB_representation_from_LA),
names_to = "population")) +
aes(x = year, y = value, color = population) +
geom_line() +
ylim(0,4e6) +
labs(title = "DR, Venezuela, and PR combined eligible populations",
ylab = "population") +
theme_minimal()
Separately, Dominican Republic exhibits over-representation in the MLB while Venezuela exhibits under-representation (see below). Both of these findings are pronounced but are hardly surprising considering the extensive development of baseball academies in Dominican Republic and Puerto Rico and the relations between USA and Venezuela (Reuters (2019), AP News (2019), Reuters (2012), and ESPN (2007)). There is also over-representation in Puerto Rico, but this over-representation is not consequential given its magnitude. Assuming that the under-representation in Venezuela cancels with the over-representation of Dominican Republic and Puerto Rico in the MLB cancels for reasons not due to mismeasured interest or country-specific talent differences, then our construction of the talent pool is in balance with the interest-adjusted talent pool from Dominican Republic and Venezuela after the influx of MLB talent from those countries is fully realized. This supports our definition of the talent pool and our adjustments, particularly our adjustment for changing baseball interest.
## DR
bazDR = datMLBeligible %>% filter(year >= 1980)
bazDR$perc_MLB_DR =
foo_tot %>% group_by(yearID) %>%
summarise(perc_MLB_DR =
sum(ifelse(birthCountry %in% c("D.R."),1,0))/n()) %>%
filter(yearID >= 1980, yearID <= 2020) %>% pull(perc_MLB_DR)
bazDR = bazDR %>% mutate(pop_MLB_DR = pop*perc_MLB_DR) %>%
dplyr::select(-pop)
bazDR$DRpop_interest = datDR %>% filter(year >= 1980,year <= 2020) %>%
pull(pop) * 1e6
## Ven
bazVen = datMLBeligible %>% filter(year >= 1980)
bazVen$perc_MLB_Ven =
foo_tot %>% group_by(yearID) %>%
summarise(perc_MLB_Ven =
sum(ifelse(birthCountry %in% c("Venezuela"),1,0))/n()) %>%
filter(yearID >= 1980, yearID <= 2020) %>% pull(perc_MLB_Ven)
bazVen = bazVen %>% mutate(pop_MLB_Ven = pop*perc_MLB_Ven) %>%
dplyr::select(-pop)
bazVen$Venpop_interest = datVen %>% filter(year >= 1980, year <= 2020) %>%
pull(pop) * 1e6
## PR
bazPR = datMLBeligible %>% filter(year >= 1980)
bazPR$perc_MLB_PR =
foo_tot %>% group_by(yearID) %>%
summarise(perc_MLB_PR =
sum(ifelse(birthCountry %in% c("P.R."),1,0))/n()) %>%
filter(yearID >= 1980, yearID <= 2020) %>% pull(perc_MLB_PR)
bazPR = bazPR %>% mutate(pop_MLB_PR = pop*perc_MLB_PR) %>%
dplyr::select(-pop)
bazPR$PRpop_interest = datPR %>% filter(year >= 1980, year <= 2020) %>%
pull(pop) * 1e6
rbind(
bazcombined %>% select(year, pop_MLB_LA, LApop_interest) %>%
rename("MLB representation" = pop_MLB_LA, "eligible pop" = LApop_interest) %>%
mutate(region = "combined"),
bazDR %>% select(year, pop_MLB_DR, DRpop_interest) %>%
rename("MLB representation" = pop_MLB_DR, "eligible pop" = DRpop_interest) %>%
mutate(region = "DR"),
bazVen %>% select(year, pop_MLB_Ven, Venpop_interest) %>%
rename("MLB representation" = pop_MLB_Ven, "eligible pop" = Venpop_interest) %>%
mutate(region = "Ven"),
bazPR %>% select(year, pop_MLB_PR, PRpop_interest) %>%
rename("MLB representation" = pop_MLB_PR, "eligible pop" = PRpop_interest) %>%
mutate(region = "PR")
) %>%
pivot_longer(cols = "MLB representation":"eligible pop", names_to = "population_label") %>%
ggplot() +
aes(x = year, y = value, color = population_label) +
geom_line() +
facet_wrap(~region) +
theme_minimal() +
labs(title = "Representation in the MLB vs estimated talent pool size",
subtitle = "for DR, Venezuela, and PR",
y = "population",
x = "year")
Pursuit of fairness in gauging the talent pool for different eras has involved connecting the population data and the history of the major sports in the United States. First there is baseball’s geographic spread out of the North East and Midwest regions. This spread led to the sport’s increasing organization as local clubs, state associations, and “national” leagues were formed. Another variables that factored into how we gauge the player eligible population is the pace of globalization of baseball’s talent pool, and different rates of integration in the NL and AL. We adjusted our estimate of the talent pool to account for competition for athletic talent that baseball faced from other sports. Short term contractions of the talent pool due to wars were also accounted for.
We consider an estimate of the talent pool at different levels of baseball interest. Note that our estimate above sets baseball interest as 50% from surveys of people listing their favorite sport and 50% from general baseball interest. Here we consider:
75% from surveys of people listing their favorite sport and 25% from general baseball interest.
100% from surveys of people listing their favorite sport and 0% from general baseball interest.
These estimates of the talent pool do not align with the interest-adjust eligible populations of Latin American countries when combined. We will also consider estimates of the talent pool such that:
The talent pool is constant throughout time (1 million people).
The talent pool contracts sharply following the erosion of the minor leagues. For this estimate, we will suppose that baseball interest is at 90% for every season up to 1949 (the peak of the minor leagues in terms of number of leagues and teams) and will decrease to the interest level in our original analysis in 1964 (the season in which the collapse stops).
# 75% for favorite; 25% for general interest
MLB_interest = data.frame(
year = 1947:2020,
interest = (approx(favorite$year, favorite$interest, xout = 1947:2020)$y * 0.75 +
approx(interest$year, interest$interest, xout = 1947:2020,
rule = 2)$y * 0.25)
)
# The existence of the MLB itself
MLB_interest = rbind(c(1870,0.20), c(1876, 0.20), c(1886, 0.22), c(1920,0.30),
c(1930, 0.38), MLB_interest)
#write_csv(datMLBeligible, "datMLBeligible_0.75_favorite_sport.csv")
#write_csv(dat_interest, "dat_interest_favorite_0.75_sport.csv")
MLB_interest = favorite
# The existence of the MLB itself
MLB_interest = rbind(c(1870,0.20), c(1876, 0.20), c(1886, 0.22), c(1920,0.30),
c(1930, 0.38), MLB_interest)
#write_csv(datMLBeligible, "datMLBeligible_favorite_sport.csv")
#write_csv(dat_interest, "dat_interest_favorite_sport.csv")
# 100% for favorite; 0% for general interest
MLB_interest = favorite
# The existence of the MLB itself
MLB_interest = rbind(c(1870,0.20), c(1876, 0.20), c(1886, 0.22), c(1920,0.30),
c(1930, 0.38), MLB_interest)
#write_csv(datMLBeligible, "datMLBeligible_favorite_sport.csv")
#write_csv(dat_interest, "dat_interest_favorite_sport.csv")
# 50% for favorite; 50% for general interest
MLB_interest = data.frame(
year = 1947:2020,
interest = (approx(favorite$year, favorite$interest, xout = 1947:2020)$y +
approx(interest$year, interest$interest, xout = 1947:2020,
rule = 2)$y)/2
)
# Interest level = 0.9 before 1950
MLB_interest = rbind(cbind("year" = 1870:1949, "interest" = 0.9),
MLB_interest[MLB_interest$year >= 1964, ])
dat_interest = data.frame(
year = 1870:2020,
interest = approx(MLB_interest$year, MLB_interest$interest, rule = 1,
xout = 1870:2020)$y
)
#write_csv(datMLBeligible, "datMLBeligible_erosion.csv")
#write_csv(dat_interest, "dat_interest_erosion.csv")